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Abstract 

A long standing issue in population genetics is the identification of geneti- 
cally homogeneous populations. The most widely used measures of popula- 
tion structure are Wright's F statistics (Wright 1931). But the fundamental 
prerequisite of any inference based on these statistics is the definition of popu- 
lations and this definition is typically subjective (based on linguistic, cultural 
or physical characters, geographical location). The population structure may 
be difficult to detect using visible characters. 

We propose a Model-Based Clustering (MBC) method combined with loci 
selection using multi-allelic loci data. The loci selection problem is regarded 
as a model selection problem and models in competition are compared with 
the Bayesian Information Criterion (BIC). The resulting procedure selects 
the subset S n of clustering variables, the number K n of clusters, estimates 
the proportion of each population and the allelic frequencies within each 

cluster. We prove that the selected model (^K n , Snj converges in probability 

to the true model (Kq, Sq) under a single realistic assumption as the size n of 
the sample tends to infinity. The proposed algorithm named MixMoGenD 
('Mixture Model for Genetic Data') has been implemented using C + + and 
C programming languages. An interface with R was created. Numerical 
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experiments on simulated data sets was conducted to highlight the interest 
of the proposed loci selection procedure. 

Key words: Model-Based Clustering, Model Selection, Variable Selection, 
BIC, Population Genetics 



1. Introduction 

Evolutionary processes have produced an immense array of biological di- 
versity on our planet, with species displaying complex adaptations to their 
environments. Understanding this diversity and complexity, its origins, and 
its implications is a daunting challenge. Population genetics provides tools to 
meet this challenge. It is concerned with origin, amount, and distribution of 
genetic variation present in populations of organisms and fate this variation 
through space and time. A long standing issue in population genetics is the 
identification of genetically homogeneous populations. The most widely used 
measures of population structure are Wright's F statistics (Wright 1931). But 
the fundamental prerequisite of any inference based on these statistics is the 
definition of populations and this definition is typically subjective, based 
on linguistic, cultural or physical characters, or geographic location. The 
population structure may be difficult to detect using visible characters. 

This article is concerned with population structure that is difficult to 
detect using visible characters (such as linguistic, cultural, physical charac- 
ters, or geographic location), but may be significant in genetic terms. For 
example, the problem of cryptic population arises in the context of DNA 
fingerprinting for forensics, where it is important to assess the degree of pop- 
ulation structure to estimate the probability of false matches (D. J. Balding 
and Nichols RA 1994 et 1995 Q, Q, 8; Foreman et al. (1997) @). 

Several Model Based-Clustering (MBC) for multi-locus genetic data have 
been developped in recent years: STRUCTURE by J. K. Pritchard et al. 
(2000) 0, BAPS by J. Corander. et al. (2004) 0, FASTRUCT by 
Olivier Franois et al. (2006) [10]. These methods attempt to group sam- 
ples into clusters of random mating individuals so that the Hardy- Weinberg 
(HW) and linkage disequilibria (LD) are minimized across the sample. Al- 
though Hardy- Weinberg and linkage equilibria models are based on several 
simplifying assumptions that can be unrealistic, they have still proven to be 
useful in describing many population genetics attributes and will serve as a 
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useful base model in the development of more realistic models of microevo- 
lution. STRUCTURE and BAPS are bayesian methods that use MCMC 
algorithms and thus require much longer computations than frequentist like- 
lihood methods using Expectation-Maximization (EM) algorithm [8j. 

Multi-locus data sets are becoming increasingly large due to the explosion 
of genomic projects. But, the structure of interest may be contained in only 
a subset of available loci, the others being useless or even harmful to detect 
a reasonable clustering structure. It then becomes necessary to select the 
optimum subset So of loci which cluster in the best way the population. None 
of the methods cited above include a loci selection procedure. Defining the 
optimum set Sq of loci and optimum number K of subpopulations requires 
a suitable variable selection procedure. 

In this article, we propose a new clustering method and an associated 
algorithm named Mixture Model for Genetic Data (MixMoGenD) that has 
three benefits. First, it is model-based. Second, it is based on EM algorithm, 



so it is relatively fast compared to its counterparts based on MCMC [10 
The main benefit of our proposed method is that it is coupled with a loci 
selection procedure based on Bayesian Information Criterion (BIC) and on 
a backward stepwise method. Recall that in clustering, classification is not 
observed and there is no a priori knowledge of the structure being looked 
for in the analysis, and of the subset of available loci that are relevant for 
discrimination. So there is no simple pre-analysis screening technic available 
to use. Thus it makes sense to include loci selection procedure as a part 
of the clustering algorithm as recommended by C. Maugis et al. (2007) 



14| in a gaussian framework. The resulting procedure selects the subset S 



of clustering variables, the number K of clusters, estimates proportion and 
allelic frequencies within each cluster. 

We recast loci selection problem and the estimation of the number of 
clusters as a model selection problem. Bayes factors, the ratio of integrated 
likelihood for models, are used to compare models, so that the models to be 
compared can be non-nested. Since integrated likelihood is usually difficult to 
compute, the Bayesian Information Criterion (BIC) for the compe ting models 
is used to approximate the log-likelihood. Raftery et al. (2006) [16J showed 
that compared to clustering methods based on all variables, variable selec- 
tion method based on the BIC consistently yielded more accurate estimates 
of the number of clusters in a gaussian context. The proposed method can 
be applied to various types of markers (e.g. microsatellites, restriction frag- 
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ment length polymorphisms (RFLPs), or single nucleotide polymorphisms 
(SNPs)). 

The model and methods are presented in section [2J The consistency of 
the estimators of the number of populations and the set of loci relevant for 
discrimination is proved in Section [3] under a single realistic assumption. The 
proposed algorithm has been implemented using C + + and C programming 
languages. In Section [U Numerical experiments on simulated data sets was 
conducted to highlight the interest of the proposed loci selection procedure. 

The program, sample project files and their simulation parameters, and 
documentation for linux OS are available free of charge at : 
http : / /www . math . u-psud/~toussilej 

2. Model and methods 

In this section, we present the model and our clustering method using loci 
selection based on the Bayesian Information Criterion. The loci selection 
procedure is presented in sub-section \'2.2\ the identifiability of models in 
competition and of parameters are discussed in sub-section 12.4} and the EM 
equations are given in sub-section 12.31 

2.1. Notations and estimation method 

The data set we shall deal with consists of genotypes of n diploid in- 
dividuals labeled 1, . . . , i, . . . , n at L loci labeled 1, . . . , I, . . . , L. The 

observations are written as x = (xlj.^ n . z _ 1 L , where x\ = {x-^, x l i2 } is 

the genotype of the i th individual at the I th locus. The data set x is assumed 

to be a realization of a random vector X = [Xf] . , , , , , where X) = 

\ i/i=i,. ..,n, ;=i,...,L' * 

\X\ X , X- 2 }, with X\ x and X\ 2 taking values in the set {1, . . . , I, . . ., Ai}, 
and 1, . . . , j, . . . , Ai denote the labels of distinct alleles that are observed 
at locus I. We assume that the variables Xi = (X|), =1 L , i = 1, . . . , n are 
independent and identically distributed. 
Let : 

• Zi be the (unobserved) population of origin of individual i; 

• Tik '■= P = k) be the proportion of population k; 

• a k) i ;j := P [X\ tl = j\ Zi = k) = P (X l i2 = j\ z-i = k) be the frequency 
of the j th allele at locus I in population k; 
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• X be the set of possible genotypes from observed alleles; 

and let z = (z 1 , . . . , z n ), n = ir K ) and a = (^k,i,j) k =i,...,K- i=i,...,l- j=i,...,A, 

The 7Tfc's are called the mixing proportions and represent the prior probability 
of an individual coming from each population k. 

Our main modeling assumptions are 

(Tii )'■ Hardy- Weinberg Equilibrium (HWE) within populations and 
(7i 2 ): complete Linkage Equilibrium (LE) within populations. 

Model-based methods proceed by assuming that observations from each 
cluster are drawn from some parametric model and the overall population is a 
finite mixture of these populations. Thus, without loci selection, observations 
x — (xi, . . . , x n ) are supposed to be a sample from the probability distribu- 
tion with the likelihood contribution of individual % given by the following 
equation 



K 

P K ( Xi \ 9) :=P( Xi \ K, 8) = J2* k 



Y[P (x\\ Zi = k, a k: i t .) 



.1=1 



(1) 



where 9 = (ir, a) is a parameter ranging in a certain space Qk, for a given 
number K of populations. 



In this model of probability distributions, all the L loci are supposed to 
be relevant for clustering. Now, the structure of interest may be contained in 
only a subset S of available loci, the others being useless or even harmful to 
detect a reasonable clustering structure. Let S c be the subset of loci that are 
irrelevant for clustering (SU S c = {1, . . . , L}). The natural third hypothesis 
is the following. 

(H3): the alleles of the loci of S c are identically distributed in the overall 
population, i.e 

= a 2 ,ij = ... = a K ,i,j =■ V/ G S c and Vj E {1, . . . , A t } . 

(2) 

The allele frequencies are given by the Hardy- Weinberg model: 

P (x\\ Zi = k, a M> .) = (2 - \i ii=x i i2 ]j <x k ,i, x \ A x «m,< 2 - ( 3 ) 
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Although this model makes several simplifying assumptions that are unreal- 
istic in some cases, it has still proven to be useful in describing many popu- 
lation genetics attributes and will serve as a first tool in the development of 
more realistic models of microevolution. 

Under the three assumptions (Hi), (H2) and (H3), and given the number 
K of populations and the subset S of relevant loci, the observations are 
supposed to be realizations of a sample from a probability distribution of the 
form 



P{K,s)(xi\ 9) := P( Xi \ K, S, 9) 



K 



^2^kY[p (4\ z i = k > a k,i, •) 



k=i les 



X 



ies c 



where 9 := (tt, (cx.j, .) leS , (A, -)ies c ) * s a multidimensional parameter rang- 
ing over some space Q(k, s)- Each individual is assumed to originate in one 
of the K (unknown) populations, each with its own allele frequencies. Thus 
these parameters verify the following properties : 



< T\k < 1, k = 1, 
Ef=i = 1. 



< a h , i, < 1, fc = 1, . . . , X, ieS, a = 1, . . . , A,; 

Eii a*, i, o = 1, *; = 1, . . . , ^, / = 1, . . . , l. 

0<Pi, a <l, leS c , a = l,..., A i; 



(5) 

(6) 
(7) 



The number K of populations, the subset S of relevant loci for clus- 
tering, the proportions 71 of populations, the allele frequencies a and (5 := 
(fii,j)i£S c - 3=1 A t are treated as the parameters of the model, which have to 
be infered. The variable Zi, the assignment of individual % to its population 
is not observed and has to be predicted. 

Infering on K and S is regarded as a model selection problem. In fact, 
each value of (K, S) defines a parametric model 

M {k ,s) = {P { k,s)(-\9); 9ee (K , S )} 
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of probability distributions. Let K max be the maximum number of clusters. 
i^ max has to be specified by the user for identifiability purposes discussed in 
the sub-section 12.41 hereafter. Let us consider the collection C of competing 
models: 

C = {M {K , S ) ■ Ke{l,..., K max } and S G V* (L)} , (8) 

where V* (L) is the set of non-empty subsets of the available loci {1, . . . , L}. 
In a Bayesian framework, the model M.(^ § ^ maximizing the posterior 

probability is to be chosen : 

[K n , S n ^j = arg max P [(K, S)\ x}. (9) 

By Bayes Theorem and assuming a non informative uniform prior distribu- 
tion P [(K, S)] on the competing models M.{k, s), K = 1, ... , K max , S G 
V* (L), one has 

(K n , S n ) =argmaxP[x| (K, S)} . (10) 

The quantity P[x\ (K, S)] is the integrated likelihood of model M.(k, s), 
namely 

P[x\ (K, S)]= [ (f[PiK,s){xi\0))p[0\ (K, S)]d0, (11) 

Jeee {Ky s) V i=1 J 



where d6 is a measure on the parameter space Oik, s) and P[0\ (K, S)], 
the prior distribution (Kass and Raftery 1995 12|). This integrated like- 
lihood is analytically difficult to compute. An asymptotic approximation 
of 21nP[x| (K, S)} is generally used; this approximation is the Bayesian 
Information Criterion (BIC) defined by 

n 

BIC(K, S) = 2J2^P(K, s) (xi\ e M L,(K,S)) ~ d {K> 5) lnn, (12) 



i=i 



where d(x, s) is the dimension of the parameter space Q(k, s) an d @ml,(k,s), 

ximi 
is given by 

[K n , S n ) = arg 

(K, s) 



the maximum likelihood estimate of 9 in Q(k, s)- Thus, the selected model 
(2 n , S n ) = arg max BIC (K, S) . (13) 



The maximum likelihood estimate 9 ML $ \ yields the Maximum a 
Posteriori (MAP) prediction rule defined by 

?i = arg fce {f aX j?} H Zi = k ' ® ML (^> &)) " 

One can notice that 6 ml, (k, s) = (jML, (k, s), Pml,(k,s)^ , where 7 = (vr, a). 
The maximum likelihood estimate jml,(k,s) is computed usingthe Expec- 
tation Maximization (EM) algorithm (Dempster et al. (1977) [8|), and the 
likelihood estimate (3ml,{k,s) is given by the observed frequencies of the alleles 
of the loci of S c . 

Before giving the EM equations, let us present the loci selection proce- 
dure. 

2.2. Combined Loci selection and clustering procedure 

The space of competing models can be very large, consisting of all com- 
binations of all (2 L — l) non-empty subsets of the available loci with each 
possible number of populations. Thus an exhaustive research of an opti- 
mum model is very painful in most situations. We adopt a two nested-step 
algorithm as proposed by C. Maugis et al. (2007) [3] : 

Step 1. For all K G {1, . . . , K max }, we reseach 

S n (K) = arg max BIG (K, S) (15) 

sev*(L) 

by a backward stepwise procedure detailed hereafter. 
Step 2. We determine 

K n = arg max BIC (k, S n (K)) . (16) 

K&{\,..., _fsT max } V / 



We prefer a backward stepwise procedure rather than a forward stepwise 
as in Kass and Raftery (1995) [lij because starting the selection algorithm 
with all loci included allows the model to take loci interactions into account. 
At each stage, the algorithm searches for a locus to remove, and then assesses 
whether one of the current irrelevant loci can be selected. Thus the algorithm 
is making use of an exclusion and an inclusion procedures described hereafter. 
The decision of excluding or including a locus from the set of clustering loci 
is based on the BIC approximation of the Bayes factor. 
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Backward Stepwise selection procedure. 

1 - Initialisation : S = {1, . . . , L}, S c = 0. 

2 - Exclusion step : The proposed locus for removal from the currently 

selected clustering loci S is chosen to be the one from this set without 
which the model is the best among the models with §S — 1 loci. 

c ex = axgmaxBIC(K,S\{l}). (17) 

l£.S 

This candidate c ex is excluded if the model (K, S \ {c ex }) is better 
than (K, S), i.e. 

BIC (K, S) - BIC (K, S \ {c ex }) < 0. (18) 



3 - Inclusion step : The proposed new clustering locus for inclusion in the 
currently selected clustering loci set S is chosen to be the one from 
the set S c of currently non-selected loci which shows most evidence of 
multivariate clustering including the previous selected loci. This locus 
is accepted as relevant for clustering if its evidence for clustering is 
stronger than not clustering, namely 

BIC (K, S U {c in }) - BIC (K, S) > 0, (19) 

where 

c in = arg max BIC (K, S U {/}) . (20) 

l€.S c 

The algorithm repeats 2 and 3 and stops when the proposed candidate 
for inclusion is the locus removed in the previous step or when S c is empty. 

2. 3. EM equations 

Here we describe the EM equations. To assign individual % to a cluster, 
we compute the posterior assignment probabilities = P (zi — k\ Xi). Here- 
after, we write 7^) = (tv^, a' r ') for the estimate of 7 = (n, a) at iteration 

(r) 

r of the EM algorithm. The r ifc can be describe as 

(r) = 4 ) n l esP{*u*i = k, 41 .) 
9 



(21) 



Then the update formulae for the parameters can be derived using the stan- 
dard method of the EM algorithm 



r (r+l) _ £ Jr) 



*r-->T^ (22) 



and 



En 
i=l 



T, 



(r) 
ik 



(.4-1) = V ^-"J L-'.»-'J/ , (23 ) 

z Z^i=i 'ifc 

When applying the EM algorithm to data, we need to provide values for 
. There are mainly two types of initialization methods : random ini- 
tialization methods and clustering-based initialization methods (McLachlan 
& Peel 2000). The random initialization methods assign individuals into 
clusters randomly, while the clustering-based initialization methods assign 
individuals into clusters according to some distance criteria. 

2-4- Identifiability 

Identifiability of models is necessary to have statistical consistency, which 
is a minimal requirement for an inference method. The parameters are of 
two types: (K, S) on one hand, and (tt, a, (3) on the other hand for a 
given (K, S). The identifiability of the parameter (K, S) is discussed in 
sub-section 12.4.11 For a given (K, S), the parameter /3 is always identifiable. 
The identifiability of the parameter (tt, a) is discussed in sub-section 12.4.21 
using the results of Elizabeth Allan et al. [1] which is given here in a multi- 
allelic multilocus genomic data framework. 

2.4-1- Identifiability of the parameters (K , S) 

Let V = \J, K S n M.{k, s) De the set of all probability distributions defined 
by the models M.(k, s) i n competition. We assume that the true probability 
distribution Po of the observations that we are dealing with is an element of 
V. 

For a given PeP, let us define K (P) and S (P) as follow. 
Definition 2.1. For every P inT>, 

K (P) = min [K : P G M {K , ■)} > (24) 

K 



10 



min {S : P G M { 5)} 

s 



(25) 



S(P) 

where M( K , •) = Us-A^CRT, 5) and ., 5 ) = LUr-M(-fir, s). 

This definition is justified by the following lemmas 12.11 and 12.21 

Lemma 2.1. For every (K, S) and (K', S'), if K < K' and S C S' , then 
M(k, s) Q M(k>, S')- 

Proof of Lemma 12.11 : Let P G M.{k, s) and let 9 = (tt, a, (3) G 0^ S ) 
be the parameter defining P. Let for instance define 9' = (tt', a', (3') G 
Q(k+i, s) as follows 



■Ki 



7T fc , k = 1, 



K- 1 



7T 



^ > and vr^ +1 > such that n' K 



71 



K+l 



7T K 



a 



/3' 



(3. 



K 



Then we have 9' G 0(x+i, 5) and P = P( 



{K+l, s) 



9') G M iK +i, s). 



We have just showed that M.(k, s) Q •M.(k+i, s) and there remains to 
show that M.(k, s) Q -M(k, s 1 ) f° r every S and S' such that S C 5". For such 
non empty subsets S and S" of available loci, the parameter space Q(k, s) 
can be regarded as a subset of Q(k, s 1 ) defined by the following equations : 



<*i,i, 



«jO, ■ V/ G 5" \ S. 



(26) 



Lemma 2.2. For every K 1} K 2 G {1, . . . , K max } and Si, S 2 G V* (L), we 
haveM( Kl , s^M^, s 2 ) = M( Ki ak z , SinS 2 ); whereK 1 AK 2 = mm{K 1 ;K 2 }. 



Proof of Lemma 12.21 : Let P be a probability distribution in A4(k 1} si) H 
A^(A' 2 , 5 2 )- Then for every a; in Af, P (x) is given by the following two equa- 
tions. 



P(x] 



P(x) 



A'l 



E^n p (^i k i, •)) 

,k=i leSi 



>< n p (^i to. ( 2? ) 



fe=i ;gs 2 



x 



n^w (a 2 .)). (28) 
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Assume without lost of generality that K\ < K 2 and denote A := Si \ (Si fl 
S 2 ), B := S 2 \ (Si fl 5 2 ) and C = L \ Si U S , 2 . Using equation (12T1) . the 
marginal probability distribution of the sub- vector x 52 := {x l ) leS is given 
by 



P (x S2 ) 



A* i 



_fc=i zeSinS2 
which using equation (1281) becomes 

^= x>* n p ( xi \ Wi,-)) 

.fc=i zeSinS 2 

< II p M (A 2 .)) 

!>* n p w k i,-)) 

.fc=i ie5in5 2 
which implies that P G A^^a^, Sin5 2 ) ■ 



x 



ipm k •)) 



(29) 



leB 



U p (- l \ (#■)) 



n p (^i (o 



zgAubuc* 



Obviously, by definition 12.11 f° r every Pi and P 2 in V, 

(Pi = P 2 ) =S- [tf (Fx) = (P 2 ) and 5 (PO = 5 (P 2 )} . 



(30) 



We will denote (K , S ) := (i^(P ), S(P )), and this definition is well 
compatible with the use of the BIC criterion which aims at selecting the 
smallest model dimension in statistical adjustment. 

2.4-2. Identifiability of parameter 7 = (n, a) in the model M(k, s) 

The classical definition of an identifiable model M.(k, s) of probability 
distributions requires that for any two different parameter values 9 and 
6' in parameter space Oik, S), the corresponding probability distributions 
P{k, S) ( ' I &) an d P(k, s) ( " I 0') be different. This is to require injectivity 
of the parameterization map \I/ for this model, which is defined by ^ (6) = 

P(K, S) (-\e). 

In our context of finite mixtures, the above map will not strictly be injec- 
tive because the latent classes can be freely relabeled without changing the 
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distribution underlining the observations. This is known as 'label swapping'. 
In such the above map is always at least .fTI-to-one. 

If the model is identifiable up to label swipping, then the number of 
independent parameters is at most equal to the number of distinct genotypes 

*-i+*£, ES ^-^n((X)+4)-i- < 3i > 

Despite that this condition is not sufficient, it gives an upper bound on 
K max = maxs K (S) of the number of populations where 

n MA l+ l) 

K (S) := n !^ s ,\ -. (32) 

We use that upper bound to define the collection of models in competition 
given by equation (JHJ). 

Assume that the frequencies of the distinct observed genotypes are the pa- 
rameters of interest. For a given K and S, we refer to the finite mixture model 
(CD) as the .fT-class, 1 5* (-feature model, with state space Ylies {!>••• j Q}, as 
Ai (K ; (Gi) leS ), where Gi = A '^' +1 - > is the number of distinct observed 
genotypes at locus I and \S\ the cardinality of S. 

Elizabeth S. Allman et al. (2008) [1] has proved that finite mixtures of 
multinomial distributions are generically identifiable. In the case of paramet- 
ric setting, 'generic' means that the set of points for which identifiability does 
not hold has zero-measure. Here is the result of Elizabeth S. et al. relevant 
to our setting. 

Theorem 2.1. Consider the model M. (K ; (Gi) leS ) where \S\ > 3. As- 
sume there exists a tripartition of the set S into three disjoint non-empty 
subsets Si, S 2 and S 3 , such that if Gi = Ylies x ^ en 

min (K, g x ) + min (K, Q 2 ) + min (K, G 3 )>2-K + 2. (33) 

Then the model is generically idenfiable, up to label swapping. Moreover, the 
statement remains valid when the proportions of the groups {7ik}k=i k are 
held fixed and positive. 

This result implies that one needs a minimum of genetic variability to 
guarantee the identifiability of the models in competition. For example, it 
will be difficult to detect 4 subpopulations with 3 biallelic loci such as Single 
Nucleotide Polymorphims (SNP). 
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3. Consistency 



In this section, it is proved that the probability of selecting the true 
model (Kq, Sq) by maximizing criterion ffl2|) tends to 1 as n — > oo under the 
following single assumption: 



(H) : MueX, P (u) > 0, 



(34) 



where X is the set of distinct genotypes defined by the observed alleles, and 
Po the true probability distribution of the observations. Assumption (H) 
is realistic because our method is proposed for experiments in which only 
observed alleles are considered. 



Theorem 3.1. Under assumption (H), 



lim Po 

n^oo 



K n , S n ) = (K , Sq) 



1. 



Proof of Theorem 13.11 : 



We need to prove that lim^ooPo \K n , S n ) ^ (K , St 



0. 



Pn 



K n , S n ) 7^ (Kq, S 



< £ P" 

(K, S)jt(K , S ) 



K n , S n ) = (K, S) 



so that since the number of possible (K, S) is finite, the theorem is proved 
if for every (K, S) ^ (K , S ), lim^ooPo [K n , S n ) = (K, S) = 0. 



Let (K, S) G {1, . . . , K max } x V* (L) such that (K, S) ^ (K , S ). We 
have 



P 



(X, S n ) = (K, S)] < P 



-2 sup e n (p {Ko , So ){-\o))>C 



2 sup t n ( P (Xj 5) 

"£©(K, S) 



So) 



Inn 



(35) 



where d^K, s) is the number of independent parameters of model M.(k, s), 
and 

£ n (P) = J2 n u\nP(u) 
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is the log-likelihood. Here n u is the number of individuals in the sample with 
genotype u. Two cases are considered : P G M-{k, s) and P ^ M-{k, s)- 

• Case 1 : P G M {K , sy 

Let T> (X) denote the set of all probability distributions on the set X of 
distinct observed genotypes. Since M.(k, s) C V(X), 

L (Po) < SUp 4 (p {K> S )(-\0))< SUp l n (P) , 

6aQ>(K, s) V / Pev(x) 

so that 

o< sup e n (p {KtS )(-\o))-£ n (p )< sup mp)-MPo). 

eee (Ki s) V / PeD(A') 

But it is well known that 2 sup PeV ( X ) i n (P) — 2£ n (P ) converges in distri- 
bution to a chi-square variable with \X\ — 1 numbers of freedom, where \X\ 
denote the cardinality of X. Also, if P G -M(x, 5) and (K, S) ^ (K , S ), 
d(K, s) — d(K , So) > 0- Thus in this case 



lim P 



2 sup £ n (p {k , S ) ( • | 0) )~2 sup £ n (p {Ko , s ) ( • I 0) 

9 ee (if , s) V / 0£Q(k o , s 0) V 



>( d {K} S ) ~ d {Ko , So) I m ™ 



= 0. 



• Case 2 : P ^ 5) 

For every S > 0, let 

S) = {^ ©(X, 5) : G P (x , 5) (x| 
The key point is the following: 

Proposition 3.1. Under assumption (H), there exists a real 5 > such that 
for every (K, S), 



sup -l n (p {K , S ) ( • I 0) ) = sup -4 f P (Ki S ) ( • | 9) ) + o Po (1) . 

9ee (if , s) n V / eee[ K s) n \ J 

(36) 
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Proof of Proposition 13.11 : 



-In f(if,S)(- 



7 we* 



S) 



PoH+OPo(l) 



X lnP(K, 5) H 



(37) 



The set of 8 > such that Q S , K g s 7^ 0. Let <5 > be such a real and 6* 



an 



element of Q, K s y Since for any u, Pq(u) > 0, using ([3j 

^n(P(^ S) (- |#)) >^P (x)ln5 + o Po (l) = ln5 + op (l). (38) 



Let 8 be a real such that < 8 < 5 int ^x p oM and 8 < inf ue ^ P (u). Remark 
that < inf ug ^- Pq (u) < 1 and < 8 < 1 imply ln<5 < In 8, so that 



< 8 < <5 inf 5 P oW < s. 

Then 0^ ^ C Q S , K and thus 



sup — 



'n I P(iC, 5) 



> sup 



' n \ P(K, S) 



\0) 



If now 6 G 0(^ 5) \ O s ^ K m, then there exists a genotype uj 6 ^ such 
that P(A', 5) #) < 8. In such a case 

1 

n 



— C I P(K, S) 



< infPo(u) In 5 + o Po (1) 



,Y 



< inf P (m) ln5 inf * p oM + OPo (1) = In 5 + o Po (1) 

< sup -e n (p iK)S) (-\9))+o Po (i) 

^u, s) n v J 

< sup -4(P(K,s)(- |0))+ Ofb (l) 
see* „, n V / 



'(K, S) 



Thus, 



SUp -£ n I P ( ^ 5) ^ • 

0£©(k, s) n 



)= sup —£ n (P(K, S)(' \ 0)) + op (1) 
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which is the desired result. ■ 



Now, the set of functions {lnP^ S ) ( • I Q) > ® £ ®\k S)} * s obviously 



Glivenko-Cantelli, so that 



sup -4 P( X , 5) 



9ee° 



n 



sup P P() 



eee° 



'(if, S) - ' "^(K, S) 

and Proposition 13. II yields, for any (K, S), 



SUp -4 P( X , 5) 

:e ( . 
Also, 



sup E Po 



e ^U, S) 



sup P 



Po 



lnP (Xoi5o) (f/| 0) 



lnifa S) (C/| 6) 



E Po lnP (U) 



+ OP (1) 



+ o Po (1) . 



since Po G A4(ic 0; s ) an d Po (w) > 5 Wu E X . Thus 

- SUp £n\P(K, S) ( • I &) J SUp £n(P(K . So) 

- inf P Po lnP (f/)-lnP (K ,5)(f/| 0) + o Po (1) . 

0g0 (k, S) L 

But on the compact set Q, K ™, the function i— > E Po In P (U)— In P(^ 5 ) (P| 

is continuous and attains its infimum at a point But since Po ^ -^(x, 5), 
P (-)^P { K, s) (-| #),and 



P 



Po 



lnP (P)-lnP (X) s) (U\ 6) 



> 0. 



Noticing that lim™ (V. S )-<Ws o) )inn = ^ ^ gets 



lim P 



2 sup 4 ( P (K , 5) 



2 sup 4 ( P (if0) So) 



> I d( K . ff\ — d 



(K, S) ~ U>(K , S ) 



In n 
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4. Simulation examples 

MixMoGenD has been implemented using C + + and C programming 
languages. The main goal of the simulation examples was to confirm in 
practice the consistency of the loci selection procedure and to highliht its 
benifits. Before that, preliminary simulations were conducted to regulate 
certain known problems of the EM algorithm, in particular convergence to- 
wards the maximum likelihood and the low speed of convergence in certain 
cases. In fact, the EM algorithm converges almost always towards a lo- 
cal maximum under certain conditions of regularity. Thus it is not certain 
whether the algorithm converges towards a local or global maximum when 
there are several maxima. To reduce the dependence of the convergence point 
to the initial parameter of the algorithm, we opt for the strategy of at least 
50 initial parameters, and the maximum likelihood estimate is the one maxi- 
masing the likelihood. For each initial parameter, we stop the EM algorithm 
when the difference between two consecutive likelihoods of the complete data 
is less than a certain positive real e > to be chosen by the user. 

4-1. Consistency of the selection procedure 

The goal here was to see how the increase of the size of the sample 
improves the capacity of our clustering method to select the true model 
A4(K , s )- An interface between MixMoGenD and R was created for these 
simulations. In these experiments, we started with n = 100 individuals, 
and gradually increased this number to 400 by a step of 50. We assumed 
K = 2 populations, L = 4 loci with 2 alleles by locus, Sq with cardinality 
\S \ = 2. For each value n of the sample size, 100 data sets were generated. 
The parameters of simulation are given in Table [TJ The figure [T] shoes that 
MixMoGenD consistently identify the true model as n —>■ oo. Other simu- 
lated data with K = 3, |Sb| = 4 and |Sq| = 2 confirmed these results. These 
results confirm the theoretical result on the consistency that we showed in 
Section [3J 

4-2. Benefits of the selection procedure 

Two series of simulations were conducted to highlight the importance of 
the loci selection procedure. First, we independently generated 100 data sets, 
each of them contained 1 000 individuals. We assumed K = 3 populations 
with the proportions given by 7r = (0.20, 0.30, 0.50), L = 6 loci with the 
numbers of alleles given by (3, 4, 3, 3, 3, 4), Sq = {1, 2, 3, 4} and allele 
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frequencies given in Table [2J Using all the 6 loci, the true model was selected 
39 times against 61 for the model with K n = 2. When including the selection 
procedure, MixMoGenD selected the true model (K , So) 90 times against 
10 for (K, S) = (2, So). It appears that the number of populations is under 
estimated when considering all available loci as relevant for clustering. 

To confirm this result, a second series of simulations with more variabil- 
ity was conducted. In these simulations, each of the data sets consisted of 
1 000 individuals structured into 5 subpopulations of equal proportions. We 
assumed L — 10 loci each with 10 alleles, and two different cardinalities 
for Sq: 8 and 6. Instead of choosing manualy the allelic frequencies, we 
adopt the following strategy. For the loci in So, we first use the program 
EASYPOP [5j to simulate some data sets at some levels of Fst between 
0.03 and 0.04. Second, we estimated the allelic frequencies of the loci in S 
by EM algorithm. And thirdly, we used these estimates and uniform prob- 
ability distribution on loci in Sq to simulate the data sets with R program 



17j . Sample project files and their simulation parameters are available on 



http : / /www . math . u-psud/~toussile , 



Results 

As expected, the results in the Table [3] show that the integrated loci 
selection procedure significantly improves the inference on the number K of 
subpopulations and the quality of the prediction. The benefit of the selection 
is more important with the increase of cardinality of the subset Sq of loci 
that are not relevant for clustering. The important result is that for these 
simulations, MixMoGenD perfectly selected the true subset So of relevant 
loci for clustering. For each data set for which K n < K , we calculated 
the square matrix of the pairwise F ST between individuals sampled using the 



function Fstat of the package Geneland [llj of R program. We observed that 
there exists a threshold FsT max of pairwise Fst for which two subpopulations 
with Fst < -Fsr max are clustered together. This threshold are approximately 
0.027 on the simulated data sets we used. The more striking example is the 
data set 5 in Table [3] (d). The square matrix of pairwise F S t is given in Table 
HJ The Fst between population 4 and the others are all < 0.026. On this 
data set, MixMoGenD produces 4 clusters and we observed that Pop4 was 
uniformly distributed in the 4 clusters. 
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5. Discussion 



We believe that MixMoGenD will be useful for two main reasons. First, 
like FASTRUCT, MixMoGenD is based on the EM algorithm, so that 
both share certain qualities, particul arly they are faster than their counter- 



parts based on a bayesian approach [10 



The key point of our proposed method is that it is combined with a loci 
selection procedure. That is the main reason for which our method will be 
very useful, and it is our main contribution. In fact, the results obtained on 
simulated data show how the selection procedure improves significantly the 
inference on the number K of subpopulations and the prediction capacity. 
In addition, due to the explosion of genomic projects, data sets are becoming 
increasingly large. The space of the models in competion can then be very 
large. Then an exhaustive research of an optimum model is very painful 
in most situation and could not be achieved by methods based on MCMC 
algorithm as mentioned by 0. Francois et al. (2006) jfjj]. Thus methods like 
frequentist likelihood methods using EM algorithm will then become useful 
because they require much shorter computations than the methods based 
on MCMC algorithm. For example, E. K. Latch (2005) [l3| reported that 
a data set with 5 subpopulations, 100 individuals in each subpopulation, 
10 loci and 10 alleles by locus take approximately 3 h to run without loci 



selection on STRUCTURE 15], and 30 h on PARTITION (all times 
provided are appropriate for a computer with a 2.2 GHz Celeron processor 
and 512 MB of RAM). For such data sets, MixMoGenD and its selection 
procedure take approximately 2 h 30 to run. This was made possible thanks 
to the Backward- Stepwise algorithm, which enabled us to avoid an exhaustive 
research of the optimum model among all the models in competition. 
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Locus 


Allele 


Popl 


Pop2 


Locus 


Allele 


Popl 


Pop2 


1 


1 


0.70 


0.25 


3 


1 


0.85 


0.85 




2 


0.30 


0.75 




2 


0.15 


0.15 


2 


1 


0.35 


0.70 


4 


1 


0.50 


0.50 




2 


0.65 


0.30 




2 


0.50 


0.50 



Table 1: Parameters of simulated data to show the consistency of the selection 
procedure. K = 2, S = {1, 2}, vr = (0.30, 0.70). 



L 


Allele 


Popl 


Pop2 


Pop3 


L 


Allele 


Popl 


Pop2 


Pop3 


1 


1 


0.20 


0.40 


0.50 


4 


1 


0.30 


0.40 


0.65 




2 


0.30 


0.40 


0.20 




2 


0.60 


0.40 


0.15 




3 


0.50 


0.20 


0.30 




3 


0.10 


0.20 


0.20 


2 


1 


0.20 


0.40 


0.50 


5 


1 


0.25 


0.25 


0.25 




2 


0.20 


0.40 


0.10 




2 


0.30 


0.30 


0.30 




3 


0.40 


0.10 


0.10 




3 


0.25 


0.25 


0.25 




4 


0.20 


0.10 


0.30 




4 


0.20 


0.20 


0.20 


3 


1 


0.15 


0.25 


0.50 


6 


1 


0.40 


0.40 


0.40 




2 


0.25 


0.25 


0.10 




2 


0.30 


0.30 


0.30 




3 


0.60 


0.50 


0.40 




3 


0.30 


0.30 


0.30 



Table 2: Parameters of simulated data to show the benefit of the selection pro- 
cedure: K = 3, 7T = (0.20, 0.30, 0.50), So = {1, 2, 3, 4}. L = locus, 
Pop=Population 
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5 
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4 




4 




6 


5 


10.20 


5 


10.30 


7 


3 




3 




7 


5 


09.10 


4 




8 


3 




3 




8 


5 


07.60 


5 


08.50 


9 


5 


11.80 


3 




9 


5 


09.50 


4 




10 


3 




3 




10 


5 


10.30 


5 


10.90 


fa) 










(b) 










Data 


K n 


07 MA 

/o MA 




07 MA 

/o MA 


Data 


K n 


07 MA 

/o MA 




07 MA 
70 MA 
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5 


07.50 


5 


07.10 
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5 


14.80 


2 




2 


5 


05.40 


5 


r\ r* on 

06.30 
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5 
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06.50 
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06.70 
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13.40 
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05.90 
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05.90 
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13.60 
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06.70 
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07.20 
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4 




2 
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05.60 
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06.10 
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5 


14.30 


1 
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06.60 
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07.10 
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15.10 
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05.70 
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05.50 
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5 


13.90 


2 
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06.80 
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07.20 
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14.70 


2 




10 


5 


06.30 


5 


06.10 


10 


5 


15.20 


1 




(c) (d) 



Data K s n % MA K n % MA 



1 


5 


10.60 


3 


2 


5 


11.30 


4 


3 


5 


09.70 


3 


4 


5 


09.60 


4 


5 


5 


11.00 


4 


6 


5 


10.50 


4 


7 


5 


09.80 


4 


8 


5 


10.70 


4 


9 


5 


11.50 


4 


10 


5 


12.50 


3 



(e) 

Table 3: For all these simulations, L = 10 and K = 5. In the tables (a), (b) and 
(c), we assumed \S\ = 8 and the F$t for the loci in S were 0.0304, 0.0355 and 
0.0407 respectivelly. As expected, the rease of F$t for the loci in S improves 
the performances of MixMoGenD. In the Tables (d) and (e), we assumed \S\ =6, 
and the difference between running MixMoGenD with or without selection is clear. 
MA = Misassigned, and K n are the estimates of the number of populations 
with and without loci selection resprctively. 
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Pop4 


Pop5 


Popl 


0.00000000 


0.04112990 


0.03024947 


0.02425668 


0.03535726 


Pop2 


0.04112990 


0.00000000 


0.03831558 


0.02255300 


0.02756619 


Pop3 


0.03024947 


0.03831558 


0.00000000 


0.02255183 


0.03251246 


Pop4 


0.02425668 


0.02255300 


0.02255183 


0.00000000 


0.02509488 


Pop5 


0.03535726 


0.02756619 


0.03251246 


0.02509488 


0.00000000 



Table 4: The F$t between population 4 and the others are all < 0.026. Mix- 
MoGenD on this data set produces 4 clusters and we observed that Pop4 was 
uniformly distributed in the 4 clusters. 



% of selecting the true model (K_0=2, S_0={1, 2}) 
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n=number of indivuals in the overall population 

Figure 1: % of selecting the true model vs number of individuals 
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